Skip to content

v11: Updates for startup and congestion#721

Merged
sdrabenh merged 3 commits intodevelopfrom
feature/network-congestion
May 23, 2025
Merged

v11: Updates for startup and congestion#721
sdrabenh merged 3 commits intodevelopfrom
feature/network-congestion

Conversation

@mathomp4
Copy link
Member

@mathomp4 mathomp4 commented Apr 25, 2025

This PR adds a couple updates to setup/run.

First, @AgilentGCMS found that setting FI_PSM3_CONN_TIMEOUT=120 helped him with jobs on discover that crashed with "psm3_ep_connect returned error Operation timed out". I don't see a downside to just always setting that.

Second, I've encountered some job startup failures at NAS. Following their page on this, we already have MPI_LAUNCH_TIMEOUT=40 but this PR adds the use of several_tries

@mathomp4 mathomp4 added the 0 diff The changes in this pull request have verified to be zero-diff with the target branch. label Apr 25, 2025
@mathomp4 mathomp4 self-assigned this Apr 25, 2025
@mathomp4 mathomp4 marked this pull request as ready for review May 2, 2025 21:00
@mathomp4 mathomp4 requested a review from a team as a code owner May 2, 2025 21:00
@sdrabenh sdrabenh merged commit aa19b05 into develop May 23, 2025
12 checks passed
@mathomp4 mathomp4 deleted the feature/network-congestion branch May 23, 2025 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

0 diff The changes in this pull request have verified to be zero-diff with the target branch.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants